1 Before you start the lab

This lab will introduce you to metabarcoding. Investigating this “hidden diversity” often relies heavily on bioinformatics. Here, most of the analyses have already been carried out for you, and the focus is on interpreting the results of the analyses. All graphs in this practical have been generated in R. The code is visible for those interested (don’t worry if it does not make sense! The goal here is to interpret results) - but you can hide the code by clicking the Hide button above each code chunk. There are also a few dropdown boxes in the document that contain extra information, an example is shown below:

Dropdown box. Click me to get more information
More information


To complete this lab, answer all questions (in bold) and discuss them with your teaching assistant at the end of the lab.

Good luck! :)

1.1 Learning outcomes

After completing this lab, you will be able to:

  • Explain how DNA can be used to study microbial diversity in a high-throughput way and contrast it with using morphology
  • Use given sequences to BLAST against a reference database and identify the corresponding organism
  • Explain the concept of metabarcoding, and assess its power and limitations
  • Classify sequences based on phylogenetic placement
  • Analyse the ecology of taxa of interest using sequence information
  • Analyse and assess how different microbial communities are structured

1.2 Motivation

Microbes run the world! Protists and fungi play crucial ecological roles such as primary production, consumers and decomposers. But how do we study them given that they are so small? Furthermore, most of them are difficult to culture. One method is: using a metabarcoding approach!


Fig. 1: Metabarcoding overview (source:http://www.naturemetrics.co.uk)


Figure 1 shows an overview of the metabarcoding approach. The most commonly used DNA barcode for microbial eukaryotes is the 18S rDNA gene (though it should be noted that the internal transcribed spacer, ITS, is more commonly used for fungi). In this lab, we will focus on the 18S gene using subsets of published sequencing data from: global marinesampling expeditions such as Tara Oceans and the Malaspina Expedition, freshwater from Lake Baikal, and soils from Neotropical forests.

2 The data


In this section, you will work with a subset of data from the Tara Oceans Expedition (de Vargas et al. 2015, Science). For this lab, we have gathered sequences from two marine environments: Coral Reefs, and Open Ocean. Each environment is represented by three stations each, meaning that you will be working with data from 6 stations in total. The location of these stations is circled in the map in Figure 2.

Fig. 2: Tara Ocean sampling stations (adapted with permission from Decelle et al. 2018, Cell). Here, we will focus on the circled stations.


The Tara Oceans project used a region (V9) of the 18S gene (the Small Ribosomal Subunit gene in eukaryotes, see Figure 3 below) as a metabarcode.

Fig. 3: A cartoon of the 18S gene. The V4 and V9 regions are the most barcodes for microbial eukaryotes.


These sequences correspond to organisms living in surface water (~ 7 m depth) collected from the six stations. The water was size fractionated to recover biodiversity from major organismal size fractions: pico-nanoplankton (0.8 to 5 microns), nanoplankton (5 to 20 microns), microplankton (20 to 180 microns) and mesoplankton (180 to 2000 microns). However, in this lab we have merged all the size fractions for simplicity.

2.1 Getting a feel for your data

Normally, we start our data analysis with a LOT of reads or metabarcodes. The first step is to filter these reads based on quality and then cluster them into biologically meaningful Operational Taxonomic Units (OTUs). Here, in the interest of time, we have already cleaned the reads and clustered them into OTUs.

# read table containing metadata about stations
otus <- read.csv("data/metadata.txt", header=T, sep = "\t")
otus
##      station no_of_OTUs no_of_reads       type        time.date Temp..Celcius.
## 1  station45      16376     9777277 open_ocean 2010-04-13T03:21       30.59345
## 2  station50      10532     2742747 open_ocean 2010-05-09T05:30       27.54930
## 3 station126      31777     4930749 open_ocean 2011-08-28T18:05       26.68965
## 4 station121      11639     2396972      coral 2011-07-09T17:47       24.23274
## 5  station46      15309    15360520      coral 2010-04-15T02:01       30.12475
## 6  station49      12782     1722440      coral 2010-04-23T10:10       28.27290


Questions:
1. The Tara Oceans Project found ~110,000 OTUs in total from surface waters of tropical and temperate oceans. Pick any one of the six stations. What proportion of total species diversity is present at that station?

2.2 What do these OTUs represent anyway?

OTUs are often defined using sequence similarity thresholds (traditionally sequences less than 97% similar to each other are treated as belonging to different species, though other criteria are also frequently used). But what do these OTUs actually represent? Does OTU = species? Here we will evaluate what OTUs represent using several different examples of described species.

2.2.1 Scenario 1

You isolate two cells from soil samples in Nova Scotia, Canada and photograph them using contrast light microscopy (Figure 4).

Fig. 4: Two protist cells from soil, Canada (from Lax et al. 2018, Nature). Size of scale bar in a is 10 μm, and 5 μm in b.


Questions:
2. Based on morphology, do you think A and B are different species?

You then generate 18S sequences (provided below) for both cells to confirm their identity. You align them together to find how similar they are to each other. To do so, use the blast2seq website, and copy paste the sequences, one in each box. (NB: the following DNA sequences are in fasta format, a commonly used format for representing nucleotide and amino-acid sequences).

Click to read more about the fasta format
There are two types of lines in a fasta file. The fasta header line starts with “>” and contains a description of the sequence (here for example, the fasta headers are “>seq_A” and “>seq_B”). Subsequent lines (not starting with “>”) contain the sequence information.


>seq_A
GGTTCGATTCCGGAGAGGGAGCCTGAGAAATGGCTACCACATCTAAGGATGGCAGCAGGCGCGCAAATTACCCAATCCTGACACAGGGAGGTAGTGACAATAAATATCGATAGTCGCCTTTTTACAGGAGACTAATTGAAATGAGAACAATTTAAACCCCTTAACGAGGATCCATTGGAGGGCAAGTCTGGTGCCAGCAGCCGCGGTAATTCCAGCTCCAATAGCGTATATTAAAGTTGTTGCAGTTAAAAAGCTCGTAGTTGGATTTCAGAATCGTGCGTGGCCTGTAACAGGGTCATTCAAAGATTCGGAGCTTGATTCGAATGAGTTGAGCTTTGGGTCACGATTCTTACCTATTATTCACGGCGGGGCAACTCACCGAGGATAATTCGTTTACTGTGAAAAAATTAGAGTGTTCAAAGCAGGCGTTCGCCATTCATTATGAATACATTAGCATGGAATAATAAAATAGGACTTTAGTTTTATTTTATTGGATCCTAGGACTAAAGTAATGATTAATAGGGACAGTTGGGGGTATTCGTATTCTTGAGTCAGAGGTGAAATTCTTAGATTTCAGGAAGACGAACTTCTGCGAAAGCATTTACCAAGGATGTTTTCATTAATCAAGAACGAAAGTTGGGGGATCGAAGACGATCAGATACCGTCGTAGTCTCAACCATAAACGATGCCGACTAGGGATCAGTGAATGTTTTATATTTGACTTCATTGGCACCTCCAGAGAAATCGCAAGTTTTTGGGTTCTGGGGGGAGTATGGTCGCAAGGCTGAAACTTAAAGGAATTGACGGAAGGGCACCACCAGGAGTGGAGCCTGCGGCTTAATTTGACTCAACACGGGAAAACTTACCAGGTCCAGACATAGCAAGGATTGACAGATTGAAAGACCTTTCTTGATTCTATGGGTGGTGGTGCATGGCCGTTCTTAGTTGGTGGAGTGATTTGTCTGCTTAATTGCGATAACGAACGAGACCTTAACCTACTAACTAGTCACACTAACCTTCCGAAATGGTGTTGGTGGCGGACTTCTTAGAGGGACAACGTGTGCTGAGCACGAGGAAGTTTGAGGCAATAACAGGTCTGTGATGCCCTTAGATGTTCTGGGCCGCACGCGCGCTACACTGATACAATCAATGAGTTTCTACAAGACCTTAACTGATAAGTTTGGGTAATCTTTTGAAATTGTATCGTGATGGGGATAGATCATTGCAACTATTGATCTTCAACGAGGAATTCCTAGTAAGCGTAAGTCATCAGCTTACGTTGATTACGTCCCTGCCCTTTGTACACACCGCCCGTCGCTCCTACCGATTGGATGATTCGGTGAACTTTTCGGACCGTGATCTATTGTTGGGAAACCAGCGTGGATCG

>seq_B
GAATCAGGGTTCGATTCCGGAGAGGGAGCCTGAAGCTGTGAAACAACGGGCTTGATAGTGGCCTGGCAACAGGACGCTAGTTACGCTCTGACCGTAGCGACACCTTCAAATTGCTGGAAGTCCCTAAAGCTCTTAGTACCGCGGCGCCTGGCAACAGAGCGTGCAGCAGGTGGTGTAGTGGCCACTGGGTGGTAACAATCTAAGAGATGACACAATGGGTAACCAGCAGCCAAGTGGCCTCTGGCCATGCAGTTCACAGACTAAATGTTGGTGGGCGACCTCTGGGTTGCTTAAGATATAGTCGGCCGGCCGCAGAAATGCGGTTCTCTAAGAGGAATCTTCACTGCGAGTCAGAGAGCACGCAGCAGTGGAGAGGAGAGCTTAGAGACAGCCAGCACTGGACCAGTGCCTCGGCTGCGATCAGCGGAGAAATGGCTACCACATCCAAGGATGGCAGCAGGCGCGCAAATTACCCAATCCTGACACAGGGAGGTAGTGACAATAAATATCAATAGTGGACTTTTTACAGTACACTAATTGAAATGAGAACAATTTAAATCCCTTATCGAGGAACAATTGGAGGGCAAGTCTGGTGCCAGCAGCCGCGGTAATTCCAGCTCCAATAGCGTATATTAAAGTTGTTGCAGTTAAAAAGCTCGTAGTTGGATTTCGGGGAGGTCAGGCGTCGTAGGGCAATAGTTCTGCGGTTGTCGGCTTCTCTTACCTGTCTGGCATTGCCCTCGTTGGTGGTGCTAGGCTCGTTTACTGTGAAAAAATTAGAGTGTTCAAAGCAGGCTTACGCTGTGAATACATTAGCATGGAATAATAGAATAGGACTTTGGCCCTATTTTGTTGGTTTCTAGGACCGAAGTAATGATTAATAGGGACAGTTGGGGGCATTCGTATTTTACAGTCAGAGGTGAAATTCTTAGATTTGTGAAAGACGAACTACTGCGAAAGCATTTGCCAAGGATGTTTTCATTAATCAAGAACGAAAGTTAGGGGATCGAAGACGATCAGATACCGTCGTAGTCTTAACCATAAACGATGCCAACTAGGGATGGGCAGATGTTAACTTTATGACTCTGTCCGCACCTCCAGAGAAATCGCAAGTTTT


Questions:
3. What is the percentage identity between the two sequences? Does this information change your answer to question 2? Why or why not?

2.2.2 Scenario 2

Imagine that you collected several marine samples from different areas. You observe samples A and B under the microscope and take the following images (Figure 5).

Fig 5. Cells in samples A and B. Images re-used with permission from Tikhonenkov et al. 2019, PLOS One.


Questions:
4. Would you describe as the same species or different species based on the information you have?
5. Using the blast2seq server, what is the percentage identity between the two sequences (provided below)? Does it change your answer to the previous question?

>seq_A
GGCTCCATATACTAGTGAGAACCTACCTACTCAGTCAACTACAAGGCTAACCTTGCCAACAGCAAGGCTAATACTTGACTAACCAGCTACCTTAGGGTAGCAGCGTAAGGTCTAGCTTGACAGCTACCTTACATAGTAAGCTACTAGCTTACTGGTTGACCTGTAAGGGCTAAGGGCTGCTACTAGTAGTAGCTCGACCTAAGGCTGAGTAGAGTGCTCTCTAAGCCATTAGCTGGTAGTAAGGGTAGCGTCCTTACTAGGCGACCATGGCTACGGGGAATCAGCGTTCGATTCCGGAGAGAAAGCCTGAGACACGGCTATCACATCTAAGGAAGGCAGCAGGCGCGCAAATTGCCCAATGCAGACTCTGCGAGGCAGTGAACAGCTATACTAGCCTACATACCTAGTATGTAGGGAGGCTAATGCCTTAGAATGAATGCTGCTTAGTAAACAGCATGATAAGCTAGTAGAGGATCAGTCTGGTGCCAGCACCCGCGGTAACTCCAGCTCTGCAAGCCTATGATAATACTGTTGTCTTTAAAACGTCCGTAGCTAGCTTAAGCTAGTTGACTGCTAGTCTGTAGTATAGGCTAGCTAGTATTGCCTTTAGCTAGGTAGTGCTAGCTAGACTAACTACCTAGGTGGCTAGCTATAGCTAAGGCTGTAGCTAGCTGTAGGCTAGTAGTCTACTAGTGTAGCGACATCGCTAGTAAACAGGCATGCTCAAGGTTAGCTAAAGGGGTTTCCCTATAGCTAGAACTACATAGAGCGACACGTCAATATAGCTGGCTACTAGCTTGTCTAGTAGCCGGCGTGTACAATGAAGGGGTCTCAGGCTAGGAGTATTAGTAGGCTAGGGGTGAAATCCAGTGACCCTACTAAGACTACATGAGGCGAAAGCGCCTAGCTAGAGCCTACTTGTCGGCGATGGACGAGAGTGAAGGGCGCGAAGATGATTAGATACCGTTGTAGTCTTCACTGTAAACTATGCCAGCTTGGCTAGGTAGCTTCACCTAGTTGAAGGCTACCTAGCCATAGGGAAACTAGTAAGCTCTTCGGCTAGAGGGGTAGTACTGTCGCAAGGCAGAAACTTAAAGGAATTGACGGAAAAGCACCACCAGGCGTGGAGTCTGCGGCTTAATTCGACTCAACACGGGGAACCTTACCAGGGCAGGACACTAAGTGGATTGTCAGCCTACAGGGCTTACATGATATTGTGGAAAGTGGTGCATGGCCATTCTTAGTTCGTGAGGCGACTTGTCTCCTTAATTGAGGTAACGAGCGAGACCCTGAAGGCTAGTTGAGCCTAGCTAGCTAGTCTAGCTAGTGTGCTACCTAGTAGCCGCTCCTAGCCTGATTACTAGCGCCTAGCTAGTCGAGACAGGACAATAATAGGTCTGTTATGCCCTTAGATGGCCTGGGACGCACGCGTACTACACTGACACTCGCAGCCAGTATAGTAGCTACTAGCTATCTAGCTAGTAGCTAAGCTATGCCGTAAGGCATTGCGAAGCTGCAAAGGGTGTCTAGGTAAGGATTGCTGCCTGCAAGGGCAGCATCAATGAGGAATGCCTTGTAAGTGCCTTTCAGCATAAGGCTCTGAATACGTCCCTGCTTTTTGTACACACCGCCCGTCGCATCAAGGAACTCAACTAGGCCAATGAGCCTGCTGGACGCAAGGAAAGCAGGCAAA

>seq_B
AACCTGGTTGATCCTGCCAGTACCATAGGCTAGTCTTGAAGACTAAGCCATGCATGTGTCAGTGCACAGCCTAGTTTACTAGGCATTCGGAAACTGCGGATGGCTCCATATACTAGTGAGAACCTGCCTGCCCAGTCAACTACAAGGATAACCTTGCCAACAGCAAGGCTAATACTTGATAAACAGCTAGCTTCGGCTAGTAGCGCAAGGCTAAGCTTGACGGCTACCTTGCATAGTAGGCTTCGGCTTACTGGTTGACCTGAAAGGATTAAGAGCTGCTACTAGTAGTAGCTCGACCTAAGGCTGAGCAGAGAGCTCTCTAAGCCATTAGCTGGTTGTGGGGGTAGCGGCTTCACAAGGCGACTATGGCTACGGGGAATCAGCGTTCGATTCCGGAGAGAAAGCCTGAGACACGGCTATCACATCTAAGGAAGGCAGCAGGCGCGCAAATTGCCCAATGCAGATTCTGCGAGGCAGTGAACAGCCATACCGGCCTGCGCACCTAGTGCGCAGGGAGGCTTTAAGCCTTAGAATGAATGCTGCTTAACAAACAGCATGATAAGCTAGTAGAGGATCAGTCTGGTGTTGTCTTTAAACGTCCGTAGCCAGCTAAAGCTGCTAAGCTACTAGTCTGCAGCACAGGCTAGCTTCAACTACCTTTCGCTAGGTAGTAGTTGCTAGACTGGCTGCCTTGGCGGAAGCCTTTAGCATAAGCTGAAGGCTACTGTAGGCTGGTAGTTTAGTAGCGTAGCGACATCGCTAGTAAACAGGCATGCTCAAGGTAAGCTAAAGGGTTTTCCCTATGGCTTGAACTATATAGAGCGACACGTCAGTCAAGCAGGCTACTTCGGTAGCCGGCGTGAACAATGAGGGGGACTCAGGCCAGGAGTATTGGCAGGCTAGGGGTGAAATCCAGTGACCCTGCCAAGACTGCATGAGGCGAAAGCGCCTGGCTAGAGCCTACTTGTCGGCGATGGACGAGAGTGAAGGGCGCGAAGATGATTAGATACCGTTGTAGTCTTCACTGTAAACTATGCCAGCTTGGCTGGGCAGCTTCACCTAGTTGAAGGCTGCCTGGCCAAAGGGAAACTAGTAAGCTCTTCGGCTAGAGGGGTAGTACTGTCGCAAGGCAGAAACTTAAAGGAATTGACGGAAAAGCACCACCAGGCGTGGAGTCTGCGGCTTAATTCGACTCAACACGGGAAACCTTACCAGGGCAGGACACTAAGTGGATTGTCAGCCTACAGGGCTTACATGATATTGTGGAAAGTGGTGCATGGCCATTCTTAGTTCGTGAGGCGACTTGTCTCCTTAATTGAGGTAACGAGCGAGACCCTGAAGGTTAGTTGAGCTTGGCTAGTTTACTAGTCAATGTGCTGCTTCGCAGCCGCTCCTAACCTGATTACTGGCGCCTAGCCAGTCGAGACAGGACAATAATAGGTCTGTTATGCCCTTAGATGGCCTGGGACGCACGCGTACTACACTGACGCTCGCAGCCAGTAGAGTAGCGGCTAGTTTACTAGCCGCTAAGCTATGCTGGAAAGCATTGCGAAGCTGCAAAGGGCGTCTCGGTAAGGATTGCTGCCTGCAAGGGCAGCATAAATGAGGAATGCCTTGTAAGCGCCCTTCAGCATAGGGCACTGAATACGTCCCTGCTTTTTGTACACACCGCCCGTCGCATCAAGGAACTCAACTGAGGCAATGAGCCTGCCGGACGCAAGGAAGGCAGGCGAATAGCCTTAGTCTTACCATGATGAAGTCGTAACAAGGTATCCGTAGGTGAACCTGCAGAAGGATCA


Other evidence
Tikhonenkov et al., 2019 classified these cells as different species belonging to the same genus (Percolomonas, Heterolobosia, Discoba). Apart from the low 18S sequence similarity, ecological evidence also supports the different-species-hypothesis. Cell A is found in brackish and low salinity environments, whereas Cell B is found in high salinity environments.


2.2.3 Scenario 3

Finally, imagine that you obtained planktonic samples from Lake Erken and from the Baltic. You isolate some similar looking cells and examine them with an electron microscope. In Figure 6, cell A is from Lake Erken while cell B is from the Baltic.

Fig 6. Images reused from Logares et al. 2007, Microbial Ecology


Questions:
6. Based on morphology and ecology, are these protists different species?
7. Compare the 18S sequences of Cell A (Accession number: AY970662) and Cell B (Accession number: AY970653) using Blast2seq. What is the percentage identity? What could explain these results?

2.2.4 Summary so far

Based on these three scenarios, what can we say about what an OTU/18S sequence represents? The answer is that it depends. There are cases where morphology and DNA sequences are in agreement (such as in Scenario 1). However, there are also cases where different species can have identical 18S sequences (such as in Scenario 3). There are also cases where the opposite can happen - i.e. 18S sequences from the same species (or even the same cell!) are very different from each other. Despite these limitations, metabarcoding is an incredibly important method for documenting the organisms present in a sample in a high-throughput way. For instance, it can often be difficult to identify organisms based on morphology (such as in Scenario 2), and by sequencing the 18S gene, we can more easily place organisms in their respective clades. Next, we will play with metabarcoding data from the six Tara Oceans datasets and investigate the microbial eukaryotic diversity in marine habitats.

3 Exploring microbial diversity


3.1 Identifying OTUs

We will start with identifying all of the OTUs in our two marine environments (coral reef and open ocean), and assigning them taxonomy. This is done by searching the OTUs against a reference database. BLAST the sequences below against the GenBank database.

>otu_1485;613
gtcgctactaccgattgaacgttttagtgagacatttggactgggtcagtgtaggctttcatgcct acgttgtctccggaaagactttcaaacttgagcgtttagaggaagtaaaagtcgtaacaaggtttcc
>otu_58431;9
gtcgcacctaccgattggatgtttcgatgaggccctcggaccgtggcacgtcaccttgactggcaa cgcgctttgggaagttgtccaaatctcaacatctagaggaaggtgaagtcgtaacaaggtttcc
>otu_5604;24
gtcgctcctaccgattgagtgatccggtgaattacttggattgcagcagggctcagtgtgtgaact ttgctggagaaatgtcatgaaccttattacttagaggaaggagaagtcgtaacaaggtttct


Questions:
8. How long are the nucleotide sequences (metabarcodes) on average?
9. Considering only the top hit, identify the three OTUs to species or genus level. Which supergroup/major lineage do they belong to? Search them on Google images to bond with them a little.
10. What role do these three organisms carry out in the oceans? (Briefly read on Wikipedia/somewhere else).
11. In our fasta files, the fasta header describes two things, the unique OTU identifier (before the “;”), and the number of reads in that OTU (number after the “;”). Given that information, which organism is the abundant out of the three?


To identify all of the thousands of OTUs in our datasets, we will use the same approach as above but using automated methods. Here, instead of searching against GenBank which contains a lot of mislabelled sequences, we will use a curated database called PR2. This is a curated reference dataset specfically for the eukaryotic 18S gene. As in de Vargas et al. 2015 (the Tara Oceans paper), we will assign OTUs to a taxonomic lineage if they are at least 80% similar to a reference sequence in PR2. If they are less than 80% similar, they are considered “unassigned”.

Let’s have a quick look at the reference database, PR2, modified to include only the V9 region for the Tara Oceans data. As you can see, the fasta headers contain the taxonomy of the respective sequences.

head data/Database_W2_v9_pr2.fasta
## >AF272045 Eukaryota|Alveolata|Dinophyta|Dinophyceae|Gymnodiniales|36|Kareniaceae_12|Karlodinium_05|micrum
## gtcgctcctaccgattgagtgatccggtgaataattcggactgcagcagtgtttagttcctgaacgttgcagcggaaagtttagtgaaccttatcacttagaggaaggagaagtcgtaacaaggtttcc
## >AF272049 Eukaryota|Alveolata|Dinophyta|Dinophyceae|Gymnodiniales|39|Kareniaceae_26|Karlodinium_10|micrum
## gtcgctcctaccgattgagtgatccggtgaataattcggactgcagcagtgtttagttcctgaacgttgcagcggaaagtttagtgaaccttatcacttagaggaaggagaagtcgtaacaaggtttcc
## >AM494500 Eukaryota|Alveolata|Dinophyta|Dinophyceae|Gymnodiniales|38|Kareniaceae_16|Karlodinium_07|micrum
## gtcgctcctaccgattgagtgatccggtgaataattcggactgcagcagtgtttagttcctgaacgttgcagcggaaagtttagtgaaccttatcacttagaggaaggagaagtcgtaacaaggtttcc
## >AF272046 Eukaryota|Alveolata|Dinophyta|Dinophyceae|Gymnodiniales|35|Kareniaceae_04|Karlodinium_03|micrum
## gtcgctcctaccgattgagtgatccggtgaataattcggactgcagcagtgtttagttcctgaacgttgcagcggaaagtttagtgaaccttatcacttagaggaaggagaagtcgtaacaaggtttcc
## >AF272050 Eukaryota|Alveolata|Dinophyta|Dinophyceae|Gymnodiniales|48|Kareniaceae_39|Karlodinium_16|micrum
## gtcgctcctaccgattgagtgatccggtgaataattcggactgcagcagtgtttagttcctgaacgttgcagcggaaagtttagtgaaccttatcacttagaggaaggagaagtcgtaacaaggtttcc

3.2 Marine microbial diversity

3.2.1 Dominant supergroups in the chosen stations

For simplicity, we have already searched our OTUs against PR2. Now let’s have a broad overview of the organisms are present in our samples! In the plots below, coral reef stations will be coloured in shades of pink/coral and open ocean stations in shades of blue.

What organisms are present in our marine samples?

# load necessary packages
library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
# read in data
stations <- read.csv("data/tara.otus.taxonomy.tsv", header=F, sep = "\t")
colnames(stations) <- c("group", "no_of_OTUs", "no_of_reads", "station")

# normalize data by station
norm_stations <- stations %>% 
  group_by(station) %>% 
  mutate(perc_OTUs = (no_of_OTUs / sum(no_of_OTUs))*100) %>%
  mutate(perc_reads = (no_of_reads / sum(no_of_reads))*100)

# make custom theme for graphs
cust_theme <- theme_dark() + 
  theme(axis.text.y = element_text(angle=90, hjust=0.5), axis.text.x = element_text(angle=90, vjust=0.5, hjust=1), strip.text.x = element_text(angle=180),
        panel.grid.major.x = element_blank(), panel.grid.minor.x = element_blank(), 
        strip.placement = "outside", 
        panel.background = element_rect(fill = "white"),
        strip.background =element_blank(), 
        panel.spacing = unit(0, "lines"),
        axis.ticks = element_blank())

# order stations so they appear in the desired order
norm_stations$station <- factor(norm_stations$station, c("station46", "station49", "station121", "station45", "station50", "station126"))

# plot richness/diversity
ggplot(norm_stations, aes(x = group, y = perc_OTUs)) + 
    geom_bar(aes(fill = station), color="white", stat = "identity", position = "dodge") +
    scale_fill_manual(values = c("station121" = "#F8766D", "station126" = "#00BFC4", "station45" = "#0B8CDB", "station46" = "#E07B58", "station49" = "#E05893", "station50" = "#0BDBA8")) +
    xlab("") + ylab("% of OTUs") +
    ggtitle("Diversity") +
    cust_theme

# plot abundance
ggplot(norm_stations, aes(x = group, y = perc_reads)) + 
    geom_bar(aes(fill = station), color="white", stat = "identity", position = "dodge") +
    scale_fill_manual(values = c("station121" = "#F8766D", "station126" = "#00BFC4", "station45" = "#0B8CDB", "station46" = "#E07B58", "station49" = "#E05893", "station50" = "#0BDBA8")) +
    xlab("") + ylab("% of reads") +
    ggtitle("Abundance") +
    cust_theme

Questions:
12. The plots generated show relative species diversity and abundance. Why is it better to show relative values than absolute values?
13. A large proportion of OTUs could not be classified (“Unassigned”). What could they represent?
14. What is the most diverse lineage of organisms in each environment? Is it heterotrophic or photosynthetic?
15. What is the most abundant lineage of organisms in each environment? Is it heterotrophic or photosynthetic?
16. When estimating abundance of organisms, we assume that the number of reads is proportional to the number of respective organism; i.e. we assume that the species with the largest number of reads is the most abundant. Can you think of other factors that may affect the number of reads?


3.2.2 Dominant major lineages in global marine waters

Now let us take a closer look at the most dominant major lineages in the global ocean (i.e. data from all Tara Ocean stations; data obtained from the Tara Oceans Companion Website, Dataset W6).

In the treemaps generated below, each group is represented by a rectangle, and the area of each rectangle is proportional to its value. Hover and click for interactivity on the interactive treemap. Hovering over a group will also give its value :)

Most diverse lineages

# load necessary packages
library(treemap)
library(d3treeR)

# dataset
data <- read.csv("data/w6.all-tara.tsv", header=T, sep = "\t")

# basic treemap
p_diversity <- treemap(data,
            index=c("Supergroup","Major_lineage"),
            vSize="No_of_OTUs",
            type="index",
            palette = "Set2",
            bg.labels=c("white"),
            align.labels=list(
              c("center", "center"), 
              c("right", "bottom")
            )  
          )        

# make it interactive ("rootname" becomes the title of the plot):
inter_diversity <- d3tree2( p_diversity ,  rootname = "Hover and click groups for interactivity" )

inter_diversity


Questions:
17. List the three most diverse lineages in the global oceans (excluding Metazoa) as well as their ecological role in the oceans.

Most abundant lineages

# basic treemap
p_abund <- treemap(data,
            index=c("Supergroup","Major_lineage"),
            vSize="No_of_reads",
            type="index",
            palette = "Set2",
            bg.labels=c("white"),
            align.labels=list(
              c("center", "center"), 
              c("right", "bottom")
            )  
          )        

# make it interactive ("rootname" becomes the title of the plot):
inter_abund <- d3tree2( p_abund ,  rootname = "Hover and click groups for interactivity" )

inter_abund


Questions:
18. List the three most abundant lineages in the global oceans (excluding Metazoa) as well as their ecological role in the oceans.

3.2.3 Novel diversity

Before metabarcoding studies, our view of microbial diversity in the oceans was largely informed by describing morphological species that were cultured. The Tara Oceans expedition in particular helped change our view of planktonic diversity and discovered lots of novel diversity.

The plot below compares the number of reference sequences in PR2 and total Tara-Oceans V9 OTUs for a subset of groups. Note that reference sequences in PR2 are not derived from marine environments, but also include soils and freshwater organisms.

# Morphological species vs. Tara OTUs

data <- read.csv("data/pr2_otus.tsv", header=T, sep = "\t")

ggplot(data, aes(fill=Source, y=Number, x=Group)) + 
  geom_bar(position="dodge", stat="identity") +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

Questions:
19. Why are the groups Myxogastrea and Streptophyta found relatively less in Tara Oceans data?.
20. In which groups was the most hidden diversity discovered? List three groups. Roughly how many folds was the diversity increase?


Finally, let’s investigate how similar Tara Oceans OTUs from the six chosen stations are to reference sequences in PR2.

data <- read.csv("data/similarity.tsv", header=T, sep = "\t")

ggplot(data, aes(x=perc_identity, fill=perc_identity)) + 
  geom_histogram(binwidth = 1, boundary = 0, closed = "left", color="white", fill=rgb(247,128,0, maxColorValue = 255)) +
  theme_light() + 
  xlim(50,100) +
  xlab("Similarity to reference sequence (%)") +
  ylab("Number of OTUs")

echo 'The total number of OTUs at these six stations is:'
cat data/similarity.tsv | wc -l

echo 'The number of OTUs between 80-90% similar to the reference database is:'
cat data/similarity.tsv | awk '$2 < 90' | awk '$2 > 80' | wc -l

echo 'The number of OTUs less than 80% similar to the reference database is:'
cat data/similarity.tsv | awk '$2 < 80' | wc -l
## The total number of OTUs at these six stations is:
##    57740
## The number of OTUs between 80-90% similar to the reference database is:
##    12311
## The number of OTUs less than 80% similar to the reference database is:
##     9631

Questions:
21. Sequences divergent from references represent novel diversity! Indeed, sequences that are less than 80% similar to any reference, cannot even be assigned to any major lineage or supergroup. What percentage of sequences are between 80-90% similar to references, and what percentage are less than 80% similar?


These results are exciting because they indicate how much we still have to learn about the organisms that keep our oceanic ecosystems functioning!

4 Investigating taxa of interest


Metabarcoding data can also be used to investigate the biodiversity and ecology of taxa of interest.

4.1 Symbiodinium

In this part of the lab, we will focus on one particular genus, Symbiodinium, and investigate its distribution across the six stations (coral and open ocean). Symbiodinium is a genus of photosynthetic dinoflagellates. These micro-algae, colloquially called “zooxanthellae” are most well-known for being symbionts and the primary producers in reef-building corals. They are thus ecologically and economically important. There has been much research carried out on Symbiodinium biology and ecology in recent years in order to understand global coral reef decline.

Questions:
22. Do you expect to find Symbiodinium in one or both environment types? Explain your reasoning.

Consider the following OTUs from station 49 (coral reef) which have been classified as Symbiodinium.

# station 49
cat data/vsearch_49.tsv | grep 'Symbiodinium' | tr '|' '\t' | cut -f 1,2,10,11
## otu_3007;366 100.0   SymbiodiniumCladeA  sp.
## otu_3567;1   98.4    SymbiodiniumCladeD  sp.
## otu_432448;1 97.7    SymbiodiniumCladeA  sp.
## otu_5241;692 100.0   SymbiodiniumCladeC  sp.
## otu_162983;2 100.0   SymbiodiniumCladeC  sp.
## otu_13108;6  80.8    SymbiodiniumCladeD  sp.
## otu_539869;1 89.9    SymbiodiniumCladeE  californium
## otu_288404;6 96.9    SymbiodiniumCladeD  sp.
## otu_107980;1 98.4    SymbiodiniumCladeA  sp.

Column 1 represents the OTU, Column 2 represents % similarity to closest reference, Columns 3 and 4 represent genus and species respectively.

Questions:
23. How confident can you be that otu_3007;366 is a Symbiodinium? Why?
24. How confident can you be that otu_13108;6 is a Symbiodinium? Why?

To resolve this, we will use the Evolutionary Placement Algorithm to “place” sequences in a reference tree. This is a very useful technique for classifying millions of short environmental reads or OTUs. Figure 7 shows an overview of the process.


Fig 7: Phylogenetic placement (source:Bik et al. 2012)

  1. For this, we need to first generate an 18S phylogeny from reference sequences from PR2. This has already been done for us. The tree below shows Symbiodinium sequences rooted with other dinoflagellates.

ref: tree

  1. We then “place” the OTUs from station 49 above on the reference tree using EPA. This algorithm decides the best location for each OTU without rebuilding the tree from scratch again. This has also been done for us, and the OTUs are shown in red colour.

station 49, coral

Questions:
25. How many OTUs are actually Symbiodinium? (These are OTUs that fall in the Symbiodinium clade.)
26. Are the results consistent with your answers to questions 23 and 24?

Similar analyses have been carried out for you for the other stations. View the trees here below.

station 46, coral

station 121, coral

station 45, open ocean

station 50, open ocean

station 126, open ocean

Questions:
26. Are the results consistent with your answer to question 22?


Thus far, we have looked at presence/absence of Symbiodinium in coral reef waters and open ocean waters. We can also investigate if there is any difference in abundance (i.e. number of Symbiodinium reads from each of the stations).

# read data
sym_abundance <- read.csv("data/sym_abundance.txt", header = T, sep = "\t")

# order stations in the order we wangt to appear in the graph
sym_abundance$stations <- factor(sym_abundance$stations, c("station_45", "station_46", "station_50", "station_49", "station_121", "station_126"))

# custom theme
require(ggplot2)

cust_theme <- theme_dark() + 
  theme(axis.text.y = element_text(angle=90, hjust=0.5), axis.text.x = element_text(angle=90, vjust=0.5, hjust=1), strip.text.x = element_text(angle=180),
        panel.grid.major.x = element_blank(), panel.grid.minor.x = element_blank(), 
        strip.placement = "outside", 
        panel.background = element_rect(fill = "white"),
        strip.background =element_blank(), 
        panel.spacing = unit(0, "lines"),
        axis.ticks = element_blank())

# plot
ggplot(sym_abundance, aes(x = stations, y = rel_abundance)) + 
    geom_bar(aes(fill = env_type), color="white", stat = "identity", position = "dodge") +
    xlab("") + ylab("Symbiodinium abundance (% of reads)") +
    ggtitle("Symbiodinium abundance") +
    cust_theme

Questions:
27. Is there a trend in abundance of Symbiodinium?
28. Based on these results, what can we say about the lifestyle of Symbiodinium? Does it exclusively live through symbiosis with corals?
29. Does Symdiodinium have a free-living stage, or does it form symbioses with other organisms? We can test this hypothesis by looking at the presense and abundance of Symbiodinium in different size fractions on an online server. (Follow the instructions below and answer this question. Recall that Symbiodinium cells are usually between 6-13 microns in diameter.).

The Ocean Barcode Atlas is an online web service for rapidly analysing the biogeography of taxa using the entire Tara Oceans dataset. Go to the website and click on Community ecological analysis. In the taxonomy field, type in “Symbiodiniaceae”, untick computation of beta diversity, set maps and bubble plots to 1, and submit your job.

Select the following size fractions: 0.8-5 microns, 5-20 microns, 20-180 microns, 180-2000 microns. What does this tell us about the life cycle of Symbiodinium?

4.2 Chytrids

Chytridiomycota (informally chytrids) are fungi with flagellated zoospores which have recently been uncovered as important parasites or saprotrophs of diatoms in marine systems (Figure 8).

Fig 8: Scanning electron microscographs of pennate diatoms infected by chytrid-like fungi which are coloured in red (image from Kilias et al. 2020. Communications Biology)

Question:
30. Using the Ocean Barcode Atlas again, can you determine which temperature range do Chytridiomycota prefer? Are they more abundant in tropical waters?

5 Patterns of microbial community composition


5.1 Oceans

What factors explain the microbial community compositions in different environmental samples? We will now return to our six Tara Ocean stations (coral and non-coral), and quantify the differences in microbial communities between the stations, and cluster them based on their differences. The measure of dissimilarity we will use is Bray-Curtis, which is commonly used in ecology. You do not need to know how it works, however if you are interested in a short, easy to read summary, see here

Importantly, the measure is between 0 and 1. Identical communities would have a Bray-Curtis dissimilarity of 0, while communities which share no species would have a Bray-Curtis dissimilarity of 1.

# load required package
library(vegan)
## Loading required package: permute
## Loading required package: lattice
## This is vegan 2.5-6
# read data
matrix <- read.csv("data/chosen_stations_filtered_otus.tsv", header=T, sep = "\t", row.names = 1)

# view the data we are working with. The table contains read counts for each OTU in every station. Each row = OTU, each column = station.
head(matrix)
##       station46 station49 station121 station45 station50 station126
## otu_1    160143    191068      63806    165918     48066      29592
## otu_2       505       651        322      1079       446        690
## otu_3     14895     10798      84622     37216     63754      51660
## otu_4    172454     20426      38846    188886     31332     467641
## otu_5     36236     12180     126136     70310     37317     213247
## otu_6    114638     64892      12028    187475    116723       7397
# transpose data so that rows and columns are switched
matrix_t <- t(matrix)

# normalise data
matrix_t_norm <- decostand(matrix_t, method = "total") 

# calculate Bray-Curtis distance and cluster the stations
matrix.bc.dist <- vegdist(matrix_t_norm, method = "bray")
matrix.bc.clust <- hclust(matrix.bc.dist, method = "average")
plot(matrix.bc.clust, ylab = "Bray-Curtis dissimilarity")

Questions:
31. What explains the clustering that we see?

5.2 Comparing oceans, lakes, and soils

Finally, we are going to zoom out and compare the broad microbial communities from ocean, lake, and soil samples. To get an overview of the communities, we will phylogenetically place the metabarcoding sequences onto a reference phylogeny as we did for Symbiodinium previously. This time however, we will place OTUs on a global eukaryotic phylogeny with 155 annotated, reference 18S sequences (these sequences are derived from the PR2 18S database) (Figure 8). Note that because this tree is based on 18S sequences only, it is not entirely consistent with what we know from phylogenomics. However, this does not impact phylogenetic placement.

Fig 8: Reference 18S phylogeny of eukaryotes with major groups annotated.


The three datasets we will be comparing today are the Malaspina dataset (a global marine expedition like the Tara Oceans), a freshwater dataset from Lake Baikal, and a dataset from Neotropical forest soils. We’re opting for the Malapspina dataset instead of the Tara because like the soil and freshwater data, it covers the longer, V4 region instead of the V9 region.

Let’s have a cursory look at how many OTUs are in each dataset (NB: normally, we would normalize data first before comparisons, but for the purpose of this lab, it is okay to proceed with un-normalised data.)

Marine - Malaspina

# marine
cat data/malaspina.fasta | grep -c '>'
## 8217

Neotropical forest soils

# soils
cat data/neotropical-soils.fasta | grep -c '>'
## 28474

Freshwater - lake Baikal

# freshwater
cat data/baikal.fasta | grep -c '>'
## 1125

The phylogenetic placement has already been carried out. Download the placement files from GitHub. To do so, first download the GitHub repository as a zip file: i.e. Code (in top-right corner in green) –> Download ZIP. After saving, unzip the folder and navigate to Metabarcoding_lab-master –> placement_files. You will see three files with the extension .jplace.

Now for the fun part! We will use the interactive Tree of Life website to visualize the difference between these three environments. Simply load each .jplace file (using the button “Choose file”, selecting the file, and click “Upload”). I recommend using three tabs so you can view all files at the same time.

You will now see a phylogenetic tree and a control panel. Select the following options in the control panel:

  • Mode options –> Branch lengths –> Ignore (this will make it easier to view the placements)
  • Label options –> Font style –> 26 px (or higher so you can view the tip labels)
  • Datasets –> Phylogenetic placements –> On

You will now see bubbles on the tree branches. Each bubble represents OTU placements on a particular branch. The bigger the bubble, the more OTUs belong to that particular lineage!

Questions:
32. What are the most dominant lineages in soils and freshwater samples?
33. In which environment are fungi the most dominant?
34. Which lineages are dominant in the oceans, but absent in freshwater and soils?
35. Are there lineages present in freshwater/soils but absent in oceans? Which ones? 36. List any other observations you find interesting!

5.3 Fin

Discuss your answers with your TA :)